-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Goodput & Badput recording and monitoring support. #783
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I don't have access to this link. Can you provide an example that we can take a look |
Please feel free to "re-request review" when ready. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
Co-authored-by: Ruoming Pang <[email protected]>
Co-authored-by: Ruoming Pang <[email protected]>
axlearn/cloud/gcp/measurement.py
Outdated
if not self._monitor: | ||
# This could happen if there are internal errors (such as access errors) from GCP services such as Cloud Logging or Cloud Storage. | ||
logging.log_first_n( | ||
logging.WARNING, | ||
"Goodput upload could not be started. Please check GoodputMonitor logs.", | ||
1, | ||
) | ||
self._monitor.start_goodput_uploader(*args, **kwargs) | ||
logging.info("Started Goodput upload to Tensorboard in the background!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self._monitor: | |
# This could happen if there are internal errors (such as access errors) from GCP services such as Cloud Logging or Cloud Storage. | |
logging.log_first_n( | |
logging.WARNING, | |
"Goodput upload could not be started. Please check GoodputMonitor logs.", | |
1, | |
) | |
self._monitor.start_goodput_uploader(*args, **kwargs) | |
logging.info("Started Goodput upload to Tensorboard in the background!") | |
if self._monitor: | |
self._monitor.start_goodput_uploader(*args, **kwargs) | |
logging.info("Started Goodput upload to Tensorboard in the background!") | |
else: | |
# This could happen if there are internal errors (such as access errors) from GCP services such as Cloud Logging or Cloud Storage. | |
logging.log_first_n( | |
logging.WARNING, | |
"Goodput upload could not be started. Please check GoodputMonitor logs.", | |
1, | |
) |
So that we check that self._monitor
is valid before invoking start_goodput_uploader
.
BTW, it's still unclear to me how self._monitor
can be None after we construct the instance of GoodputMonitor
. Are we missing a try/catch somewhere? Does the __init__
method of GoodputMonitor
raise an exception (that seems a bit unexpected)?
fv.mark_as_parsed() | ||
|
||
recorder = GoodputRecorder.from_flags(fv) | ||
recorder._recorder = mock.MagicMock() | ||
recorder.record(measurement.Event.START_JOB) | ||
self.assertTrue(recorder._recorder.record_job_start_time.called) | ||
|
||
def test_start_monitoring(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test the failure scenario?
@@ -47,6 +59,10 @@ def record(self, event: Event, *args, **kwargs): | |||
"""Records an event with the given name.""" | |||
raise NotImplementedError(type(self)) | |||
|
|||
def start_monitoring(self, **kwargs): | |||
"""Starts computing and uploading metrics at some configured interval in the background.""" | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's raise NotImplementedError(type(self))
and let subclasses decide whether to implement -- it should be fairly straightforward for a subclass to decide to not monitor, but we want the decision to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, it was originally raising NotImplementedError(type(self))
, but I suggested to change to pass
to avoid breaking subclasses upon axlearn bump.
Do we have tests to catch such breakage?
axlearn/cloud/gcp/measurement.py
Outdated
include_badput_breakdown=True, | ||
) | ||
if not self._monitor: | ||
# This could happen if there are internal errors (such as access errors) from GCP services such as Cloud Logging or Cloud Storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I just triggered the CI, sorry for not doing so early. (I suspect lines like this will fail pylint for being too long.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will re-run precomit checks.
This change adds the following:
Tested: